AITopics | multi-modal transformer

Collaborating Authors

multi-modal transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

23fa71cc32babb7b91130824466d25a5-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 21:23:31 GMT

learning, multi-modal transformer, transformer, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > Monroe County > Rochester (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

23fa71cc32babb7b91130824466d25a5-Paper.pdf

Neural Information Processing SystemsOct-9-2025, 14:25:32 GMT

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > Monroe County > Rochester (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Evaluating and Explaining Earthquake-Induced Liquefaction Potential through Multi-Modal Transformers

Youwai, Sompote, Kitkobsin, Tipok, Leelataviwat, Sutat, Jongpradist, Pornkasem

arXiv.org Artificial IntelligenceFeb-17-2025

This study presents an explainable parallel transformer architecture for soil liquefaction prediction that integrates three distinct data streams: spectral seismic encoding, soil stratigraphy tokenization, and site-specific features. The architecture processes data from 165 case histories across 11 major earthquakes, employing Fast Fourier Transform for seismic waveform encoding and principles from large language models for soil layer tokenization. Interpretability is achieved through SHapley Additive exPlanations (SHAP), which decompose predictions into individual contributions from seismic characteristics, soil properties, and site conditions. The model achieves 93.75% prediction accuracy on cross-regional validation sets and demonstrates robust performance through sensitivity analysis of ground motion intensity and soil resistance parameters. Notably, validation against previously unseen ground motion data from the 2024 Noto Peninsula earthquake confirms the model's generalization capabilities and practical utility. Implementation as a publicly accessible web application enables rapid assessment of multiple sites simultaneously. This approach establishes a new framework in geotechnical deep learning where sophisticated multi-modal analysis meets practical engineering requirements through quantitative interpretation and accessible deployment.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2502.10446

Country:

North America (0.68)
Asia > Japan > Honshū > Chūbu (0.46)

Genre: Research Report > New Finding (0.93)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-Modal Transformer and Reinforcement Learning-based Beam Management

Ghassemi, Mohammad, Zhang, Han, Afana, Ali, Sediq, Akram Bin, Erol-Kantarci, Melike

arXiv.org Artificial IntelligenceOct-22-2024

Beam management is an important technique to improve signal strength and reduce interference in wireless communication systems. Recently, there has been increasing interest in using diverse sensing modalities for beam management. However, it remains a big challenge to process multi-modal data efficiently and extract useful information. On the other hand, the recently emerging multi-modal transformer (MMT) is a promising technique that can process multi-modal data by capturing long-range dependencies. While MMT is highly effective in handling multi-modal data and providing robust beam management, integrating reinforcement learning (RL) further enhances their adaptability in dynamic environments. In this work, we propose a two-step beam management method by combining MMT with RL for dynamic beam index prediction. In the first step, we divide available beam indices into several groups and leverage MMT to process diverse data modalities to predict the optimal beam group. In the second step, we employ RL for fast beam decision-making within each group, which in return maximizes throughput. Our proposed framework is tested on a 6G dataset. In this testing scenario, it achieves higher beam prediction accuracy and system throughput compared to both the MMT-only based method and the RL-only based method.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2410.19859

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Asia > China (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification

Liu, Tengfei, Hu, Yongli, Gao, Junbin, Sun, Yanfeng, Yin, Baocai

arXiv.org Artificial IntelligenceJul-14-2024

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

long document, proceedings, transformer, (15 more...)

arXiv.org Artificial Intelligence

2407.10105

Country:

Asia > China > Beijing > Beijing (0.04)
Oceania > Australia (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Energy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)

Add feedback

Towards Multi-modal Transformers in Federated Learning

Sun, Guangyu, Mendieta, Matias, Dutta, Aritra, Li, Xin, Chen, Chen

arXiv.org Artificial IntelligenceApr-18-2024

Multi-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models without direct access to the raw data held by different clients. Despite its potential, a considerable research direction regarding the unpaired uni-modal clients and the transformer architecture in FL remains unexplored. To fill this gap, this paper explores a transfer multi-modal federated learning (MFL) scenario within the vision-language domain, where clients possess data of various modalities distributed across different datasets. We systematically evaluate the performance of existing methods when a transformer architecture is utilized and introduce a novel framework called Federated modality complementary and collaboration (FedCola) by addressing the in-modality and cross-modality gaps among clients. Through extensive experiments across various FL settings, FedCola demonstrates superior performance over previous approaches, offering new perspectives on future federated training of multi-modal transformers.

federated learning, learning, modality, (16 more...)

arXiv.org Artificial Intelligence

2404.12467

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)

Add feedback

Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data

Shen, Yiqiu, Park, Jungkyu, Yeung, Frank, Goldberg, Eliana, Heacock, Laura, Shamout, Farah, Geras, Krzysztof J.

arXiv.org Artificial IntelligenceNov-15-2023

Breast cancer screening, primarily conducted through mammography, is often supplemented with ultrasound for women with dense breast tissue. However, existing deep learning models analyze each modality independently, missing opportunities to integrate information across imaging modalities and time. In this study, we present Multi-modal Transformer (MMT), a neural network that utilizes mammography and ultrasound synergistically, to identify patients who currently have cancer and estimate the risk of future cancer for patients who are currently cancer-free. MMT aggregates multi-modal data through self-attention and tracks temporal tissue changes by comparing current exams to prior imaging. Trained on 1.3 million exams, MMT achieves an AUROC of 0.943 in detecting existing cancers, surpassing strong uni-modal baselines. For 5-year risk prediction, MMT attains an AUROC of 0.826, outperforming prior mammography-based risk models. Our research highlights the value of multi-modal and longitudinal imaging in cancer diagnosis and risk stratification.

breast cancer classification, cancer classification, transformer, (15 more...)

arXiv.org Artificial Intelligence

2311.03217

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cross Attention Transformers for Multi-modal Unsupervised Whole-Body PET Anomaly Detection

Patel, Ashay, Tudiosu, Petru-Danial, Pinaya, Walter H. L., Cook, Gary, Goh, Vicky, Ourselin, Sebastien, Cardoso, M. Jorge

arXiv.org Artificial IntelligenceApr-14-2023

Cancer is a highly heterogeneous condition that can occur almost anywhere in the human body. 18F-fluorodeoxyglucose is an imaging modality commonly used to detect cancer due to its high sensitivity and clear visualisation of the pattern of metabolic activity. Nonetheless, as cancer is highly heterogeneous, it is challenging to train general-purpose discriminative cancer detection models, with data availability and disease complexity often cited as a limiting factor. Unsupervised anomaly detection models have been suggested as a putative solution. These models learn a healthy representation of tissue and detect cancer by predicting deviations from the healthy norm, which requires models capable of accurately learning long-range interactions between organs and their imaging patterns with high levels of expressivity. Such characteristics are suitably satisfied by transformers, which have been shown to generate state-of-the-art results in unsupervised anomaly detection by training on normal data. This work expands upon such approaches by introducing multi-modal conditioning of the transformer via cross-attention i.e. supplying anatomical reference from paired CT. Using 294 whole-body PET/CT samples, we show that our anomaly detection method is robust and capable of achieving accurate cancer localization results even in cases where normal training data is unavailable. In addition, we show the efficacy of this approach on out-of-sample data showcasing the generalizability of this approach with limited training data. Lastly, we propose to combine model uncertainty with a new kernel density estimation approach, and show that it provides clinically and statistically significant improvements when compared to the classic residual-based anomaly maps. Overall, a superior performance is demonstrated against leading state-of-the-art alternatives, drawing attention to the potential of these approaches.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2304.07147

Country:

Europe > United Kingdom (0.28)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Europe > Netherlands > Drenthe > Assen (0.04)
(3 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

Crisol, Tomás, Ermantraut, Joel, Rostagno, Adrián, Aggio, Santiago L., Iparraguirre, Javier

arXiv.org Artificial IntelligenceSep-21-2022

During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-end multi-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, and visual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset. Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additional downstream tasks.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2209.10126

Country:

South America > Argentina (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback